{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Julia for Data Science - Data\n",
    "Prepared by [@nassarhuda](https://github.com/nassarhuda)! 😃\n",
    "\n",
    "In the next few notebooks, we will discuss why *Julia* is the tool you want to use for your data science applications.\n",
    "\n",
    "We will cover the following:\n",
    "1. Reading and writing files\n",
    "1. DataFrames\n",
    "1. RDatasets\n",
    "1. FileIO\n",
    "1. File types\n",
    "\n",
    "\n",
    "### Data: Build a strong relationship with your data.\n",
    "Every data science task has one main ingredient, the _data_! Most likely, you want to use your data to learn something new. But before the _new_ part, what about the data you already have? Let's make sure you can **read** it, **store** it, and **understand** it before you start using it.\n",
    "\n",
    "Julia makes this step really easy with data structures and packages to process the data, as well as existing functions that are readily usable on your data. \n",
    "\n",
    "The goal of this first part is get you acquainted with some Julia's tools to manage your data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Reading and writing to files is really easy in Julia.** <br>\n",
    "\n",
    "To see this, let's download a csv file from github that we can work with.\n",
    "\n",
    "Note: `download` depends on external tools such as curl, wget or fetch. So you must have one of these."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "P = download(\"https://raw.githubusercontent.com/nassarhuda/easy_data/master/programming_languages.csv\",\"programminglanguages.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use shell commands like `ls` in Julia by preceding them with a semicolon."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    ";ls"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And there's the *.csv file we downloaded!\n",
    "\n",
    "By default, `readdlm` will fill an array with the data stored in the input .csv file. If we set the keyword argument `header` to `true`, we'll get a second output array for the headers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "using DelimitedFiles\n",
    "P,H = readdlm(\"programminglanguages.csv\",header=true)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "P"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can use different delimiters with the function `readdlm`.\n",
    "\n",
    "To write to files, we can use `writedlm`. <br>\n",
    "\n",
    "Let's write this same data to a file with a different delimiter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "writedlm(\"programming_languages_data.txt\", P, '-')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now check that this worked using a shell command to glance at the file,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    ";head -10 programming_languages_data.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "and also check that we can use `readdlm` to read our new text file correctly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "P_new_delim = readdlm(\"programming_languages_data.txt\", '-');\n",
    "P == P_new_delim"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### DataFrames! \n",
    "*Shout out to R fans!*\n",
    "One other way to play around with data in Julia is to use a DataFrame.\n",
    "\n",
    "This requires loading the `DataFrames` package.\n",
    "\n",
    "Run this command to install all the packages used in the \"Julia for Data Science\" project -- (those packages are listed in this file: [`Project.toml`](/edit/introductory-tutorials/broader-topics-and-ecosystem/intro-to-julia-for-data-science/Project.toml)):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "] instantiate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "using DataFrames\n",
    "df = DataFrame(year = P[:,1], language = P[:,2])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can access columns by header name, or column index.\n",
    "\n",
    "In this case, `df[1]` is equivalent to `df.year` or `df[!, :year]`.\n",
    "\n",
    "Note that if we want to index columns by header name, we precede the header name with a colon. In Julia, this means that the header names are treated as *symbols*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.year"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**`DataFrames` provides some handy features when dealing with data**\n",
    "\n",
    "First, it uses julia's \"missing\" type."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "a = missing\n",
    "typeof(a)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see what happens when we try to add a \"missing\" type to a number"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "a + 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### RDatasets\n",
    "\n",
    "We can use RDatasets to play around with pre-existing datasets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "using RDatasets\n",
    "iris = dataset(\"datasets\", \"iris\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that data loaded with `dataset` is stored as a DataFrame. 😃"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "typeof(iris) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`DataFrames` provides the `describe` can give you quick statistics about each column in your dataframe "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "describe(iris)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can create your own dataframe quickly as follows"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "foods = [\"apple\", \"cucumber\", \"tomato\", \"banana\"]\n",
    "calories = [missing,47,22,105]\n",
    "typeof(calories)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "using Statistics\n",
    "mean(calories)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`missing` ruins everything! 😑"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mean(skipmissing(calories))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In fact, `describe' will drop these values too"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "describe(DataFrame(c=calories))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that `typeof(calories)` is `Vector{Union{Missing, Int64},1}`\n",
    "\n",
    "If we want to replace all `missing` values a default value, we can do it like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "newcalories = replace(calories, missing => 0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's create a `DataFrame` that shows foods and their calories from two `DataArray`s!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataframe_calories = DataFrame(item=foods,calories=calories)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Let's generate a second `DataFrame` that shows foods and their prices."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "prices = [0.85,1.6,0.8,0.6,]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataframe_prices = DataFrame(item=foods,price=prices)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also `join` these two dataframes together because they share a common column, `item`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "DF = join(dataframe_calories,dataframe_prices,on=:item)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that we used the keyword argument `on` to say that we wanted to join these dataframes together based on the `item` column."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### FileIO"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "using FileIO\n",
    "julialogo = download(\"https://avatars0.githubusercontent.com/u/743164?s=200&v=4\",\"julialogo.png\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Again, let's check that this download worked!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    ";ls"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, let's load the Julia logo, stored as a .png file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X1 = load(\"julialogo.png\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see below that Julia stores this logo as an array of colors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "@show typeof(X1);\n",
    "@show size(X1);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### File types\n",
    "In Julia, many file types are supported so you do not have to transfer a file you got from another language to a text file before you read it.\n",
    "\n",
    "*Some packages that achieve this:*\n",
    "MAT CSV NPZ JLD FASTAIO\n",
    "\n",
    "\n",
    "Let's try using MAT to write a file that stores a matrix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "using MAT"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "A = rand(5,5)\n",
    "matfile = matopen(\"densematrix.mat\", \"w\") \n",
    "write(matfile, \"A\", A)\n",
    "close(matfile)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now try opening densematrix.mat with MATLAB!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "newfile = matopen(\"densematrix.mat\")\n",
    "read(newfile,\"A\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "names(newfile)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "close(newfile)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Julia 1.0.5",
   "language": "julia",
   "name": "julia-1.0"
  },
  "language_info": {
   "file_extension": ".jl",
   "mimetype": "application/julia",
   "name": "julia",
   "version": "1.0.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
